In today’s session, we will cover

  1. what is multiple regression

  2. how to build a multiple regression model in R

  3. how to plot a multiple regression model

  • In multiple regression, we attempt to predict a dependent or response variable y on the basis of an assumed linear relationshop with several independent variables \(x_1\), \(x_2\), …, \(x_k\).
  • \[ y=β_0 +β_1x_1 +β_2x_2 + ... +β_kx_k + e \]
  • Y is the dependent variable, \(x_1\), \(x_2\), …, \(x_k\) are independent variables
  • k is the number of independent variables
  • \(\beta_0\), \(\beta_1\) …, \(\beta_k\) are regression coefficients

Example with data

We will use the data from the paper “Which words are most iconic? Iconicity in English sensory words” (Winter et al, 2017).

  • Whether word forms resemble their meaning??

    This paper investigated iconicity of English words through the relationship between systematicity and the sensory properties of words. 3001 English words were rated by English native speakers.

  • Iconicity as a function of sensory experience

Load the data

library(tidyverse)
library(broom)

rawdata<- read_csv("/Users/qiyi/Github_repository/R Group/Ch6.1-6.2/winter_2017_iconicity.csv")
rawdata %>% print(n=10, width=Inf) # width = Inf is to see all the columns of the tibble
# A tibble: 3,001 × 8
   Word      POS           SER CorteseImag  Conc  Syst    Freq Iconicity
   <chr>     <chr>       <dbl>       <dbl> <dbl> <dbl>   <dbl>     <dbl>
 1 a         Grammatical NA             NA  1.46    NA 1041179     0.462
 2 abide     Verb        NA             NA  1.68    NA     138     0.25 
 3 able      Adjective    1.73          NA  2.38    NA    8155     0.467
 4 about     Grammatical  1.2           NA  1.77    NA  185206    -0.1  
 5 above     Grammatical  2.91          NA  3.33    NA    2493     1.06 
 6 abrasive  Adjective   NA             NA  3.03    NA      23     1.31 
 7 absorbent Adjective   NA             NA  3.1     NA       8     0.923
 8 academy   Noun        NA             NA  4.29    NA     633     0.692
 9 accident  Noun        NA             NA  3.26    NA    4146     1.36 
10 accordion Noun        NA             NA  4.86    NA      67    -0.455
# ℹ 2,991 more rows
  • SER - sensory experience, Syst - systematicity, Freq - word frequency, CoreseImag - imageability, Conc - concreteness

  • Sensory experiences were rated from 1 to 7. Five common senses were considered (taste, touch, sound, sight and smell)

  • Iconicity was rated from -5 to 5

range(rawdata$Iconicity, na.rm=TRUE)
[1] -2.800000  4.466667
#check missing data
not_missing_imag<- sum(is.na(rawdata$CorteseImag))
not_missing_SER <- sum(is.na(rawdata$SER))
not_missing_Conc <- sum(is.na(rawdata$Conc))
not_missing_syst <- sum(is.na(rawdata$Syst))
not_missing_freq <- sum(is.na(rawdata$Freq))
not_missing_iconicity <- sum(is.na(rawdata$Iconicity))

not_missing_imag
[1] 1828
not_missing_SER
[1] 1222
not_missing_Conc
[1] 181
not_missing_syst
[1] 1898
not_missing_freq
[1] 53
not_missing_iconicity
[1] 0
most_iconic<- rawdata %>% mutate(Iconicity = round(Iconicity,3)) %>% 
  filter(Iconicity==max(Iconicity))
least_iconic <-rawdata %>% mutate(Iconicity = round(Iconicity,3)) %>% 
  filter(Iconicity==min(Iconicity))


most_iconic
# A tibble: 1 × 8
  Word    POS     SER CorteseImag  Conc  Syst  Freq Iconicity
  <chr>   <chr> <dbl>       <dbl> <dbl> <dbl> <dbl>     <dbl>
1 humming Verb     NA          NA  4.17    NA   251      4.47
least_iconic
# A tibble: 1 × 8
  Word      POS     SER CorteseImag  Conc  Syst  Freq Iconicity
  <chr>     <chr> <dbl>       <dbl> <dbl> <dbl> <dbl>     <dbl>
1 dandelion Noun     NA          NA     5    NA    15      -2.8

Pre-processing the Data

Plot the distribution of the raw frequency data

hist(rawdata$Freq,main = "Histogram of Raw Frequency",xlab = "Frequency",
              ylab = "Count",col = "#A376A2",border = "#DDC3C3")

  • The raw frequency data is extremely skewed, therefore, we need to log transform the frequency data

Log transform frequency

rawdata <- mutate(rawdata, logfreq = log10(Freq))
hist(
  rawdata$logfreq,
  main = "Histogram of Log Frequency",
  xlab = "Log10(Freq)",
  ylab = "Count",
  col = "#A376A2",
  border = "#DDC3C3"
)

Build the model

model1 <- lm (Iconicity~ SER + CorteseImag + Syst + logfreq, data= rawdata)


glance(model1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.212         0.209  1.00      66.4 9.79e-50     4 -1403. 2817. 2846.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>


summary(model1)

Call:
lm(formula = Iconicity ~ SER + CorteseImag + Syst + logfreq, 
    data = rawdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.07601 -0.71411 -0.03824  0.67337  2.76066 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.54476    0.18795   8.219 6.43e-16 ***
SER           0.49713    0.04012  12.391  < 2e-16 ***
CorteseImag  -0.26328    0.02500 -10.531  < 2e-16 ***
Syst        401.52431  262.90268   1.527    0.127    
logfreq      -0.25163    0.03741  -6.725 2.97e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.002 on 984 degrees of freedom
  (2012 observations deleted due to missingness)
Multiple R-squared:  0.2125,    Adjusted R-squared:  0.2093 
F-statistic: 66.36 on 4 and 984 DF,  p-value: < 2.2e-16

Check R square

The model explained about 21% of the variation in iconicity ratings.

glance(model1)$r.squared
[1] 0.2124559

Check coefficients

We can use round() to round the coefficients to make them easier for explanation.

  • round() takes two arguments, 1) a vector of numbers, 2) a number indicating to how many decimals the vector should be rounded to.
tidy (model1) %>% select(term, estimate) %>% mutate (estimate= round(estimate, 2))
# A tibble: 5 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)     1.54
2 SER             0.5 
3 CorteseImag    -0.26
4 Syst          402.  
5 logfreq        -0.25

Based on the estimates, a predictive equation can be expressed as:

\[ iconicity = 1.54 + 0.50 * SER + (-0.26) * CorteseImag + 401.52 * Syst + (-0.25) * logfreq \]

  • What is the the strongest predictor from this result?

  • What does a one-unit change mean for systematicity?

Caution

One-unit change of systematicity leads to 401.5 change on the iconicity rating. 😱


range(rawdata$Syst, na.rm=TRUE)
[1] -0.000481  0.000641

Standardization !!

To make the independent variables more comparable!

  • scale( ) centering and z-scoring
st_data<- mutate(rawdata,
                 SER_st=scale(SER),
                 CorteseImag_st= scale(CorteseImag),
                 Syst_st=scale(Syst),
                 Freq_st=scale(logfreq))

Re-build the model

model2<- lm(Iconicity~SER_st+CorteseImag_st+Syst_st+Freq_st, data=st_data)
summary(model2)

Call:
lm(formula = Iconicity ~ SER_st + CorteseImag_st + Syst_st + 
    Freq_st, data = st_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.07601 -0.71411 -0.03824  0.67337  2.76066 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.25694    0.03612  34.801  < 2e-16 ***
SER_st          0.51550    0.04160  12.391  < 2e-16 ***
CorteseImag_st -0.39364    0.03738 -10.531  < 2e-16 ***
Syst_st         0.04931    0.03229   1.527    0.127    
Freq_st        -0.26004    0.03867  -6.725 2.97e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.002 on 984 degrees of freedom
  (2012 observations deleted due to missingness)
Multiple R-squared:  0.2125,    Adjusted R-squared:  0.2093 
F-statistic: 66.36 on 4 and 984 DF,  p-value: < 2.2e-16

Important

R-square value DID NOT change!

Standardization DID NOT change the underlying model!

  • What does one-unit change of sensory experience mean?

  • It means “For each increase in sensory experience by 1 STD, iconicity rating increases by 0.5”.

Plots

  • In the first plot, we want to see the red line be as horizontal as possible, to indicate that the only variation left in the data is due to the unexplained errors.
  • The second plot is to check if the residuals are normally distributed
  • The third plot shows far individual values are from the regression line. Again we want to see a horizontal red line with no particular patterns
  • The fourth plot checks outliers
plot(model2)

Partial Regression

Plot one dependent variable while controlling for others

library (car)
avPlots(model2)
  • The slope of the line in the partial regression plot is the coefficient of the corresponding variable in the multiple regression model.

Reference

Note

Data for Statistics For Linguistics: An Introduction Using R, can be downloaded here.

https://osf.io/34mq9/files/osfstorage